Insights derived from analysis
In this section, we will employ appropriate visually driven data analysis techniques to answer the questions in the challenge.
Qn 1- Using just the credit and loyalty card data, identify the most popular locations, and when they are popular. What anomalies do you see? What corrections would you recommend to correct these anomalies?
Transaction volume for each location is analyzed using the loyalty and credit card transaction information provided. We also parsed the date information from timestamp in credit card data so as to align the comparison with loyalty data, where only date information is provided.
Based on the interactive charts, we noted that the top 3 most popular locations by transaction volumes are:
Loyalty card transaction volume
Credit card transaction volume
## Volume from loyalty card dataset by date- linked
shared_data <- SharedData$new(loyalty_data_cleaned, key = ~location)
loyalty_loc_vol_bar <- shared_data %>%
plot_ly() %>%
group_by(location) %>%
summarise(txn_count = n()) %>%
arrange(desc(txn_count)) %>%
add_bars(x = ~txn_count, y = ~location, type = "bar", hoverinfo = "text", text = ~paste("Location:",
location, "<br>Transaction Volume:", txn_count)) %>%
layout(title = "Loyalty card transaction Volume by Location", xaxis = list(title = "Transaction Volume",
zeroline = FALSE), yaxis = list(title = "Location", zeroline = FALSE, categoryorder = "array",
dtick = 1, categoryarray = ~txn_count))
loyalty_loc_vol_line <- shared_data %>%
plot_ly() %>%
group_by(location, date) %>%
summarise(txn_count = n()) %>%
add_lines(x = ~date, y = ~txn_count, hoverinfo = "text", text = ~paste("Location:",
location, "<br>Date:", date, "<br>Transaction Volume:", txn_count)) %>%
layout(xaxis = list(title = "Date", showgrid = TRUE, showticklabels = TRUE),
yaxis = list(title = "Transaction volume"))
bscols(widths = c(14, 14), loyalty_loc_vol_bar, loyalty_loc_vol_line)
## Volume from credit card dataset by date- linked
shared_data_cc <- SharedData$new(cc_data_cleaned, key = ~location)
cc_data_loc_vol_bar <- shared_data_cc %>%
plot_ly() %>%
group_by(location) %>%
summarise(txn_count = n()) %>%
arrange(desc(txn_count)) %>%
add_bars(x = ~txn_count, y = ~location, type = "bar", hoverinfo = "text", text = ~paste("Location:",
location, "<br>Transaction Volume:", txn_count)) %>%
layout(title = "Credit card transaction Volume by Location", xaxis = list(title = "Transaction Volume",
zeroline = FALSE), yaxis = list(title = "Location", zeroline = FALSE, categoryorder = "array",
dtick = 1, categoryarray = ~txn_count))
cc_loc_vol_line <- shared_data_cc %>%
plot_ly() %>%
group_by(location, date) %>%
summarise(txn_count = n()) %>%
add_lines(x = ~date, y = ~txn_count, hoverinfo = "text", text = ~paste("Location:",
location, "<br>Date:", date, "<br>Transaction Volume:", txn_count)) %>%
layout(xaxis = list(title = "Date", showgrid = TRUE, showticklabels = TRUE),
yaxis = list(title = "Transaction volume"))
bscols(widths = c(14, 14), cc_data_loc_vol_bar, cc_loc_vol_line)
Just by comparing the top 3 locations and their popular days, we noted that there are differences in the transaction count per day for each location on the loyalty card and credit card. In particular, there were days where loyalty card transactions were higher than credit card transactions. This is unexpected as loyalty card is used to collect discounts and rewards and cannot be used for payment. Hence one would expect both volumes to either be the same or for credit card volumes (actual purchase) to be higher than loyalty card volumes (in cases where the employee may have forgotten to present loyalty card for rewards/ discounts). The difference in volumes each day across both cards are illustrated below. The darker the blue hue, the greater the loyalty data transaction volume exceeds credit card transaction volume.
Some of the notable differences observed include: 1) Kronos Mart- Loyalty card volume exceeded credit card volume by 3 transactions on 01/18/2014 2) Katerina’s Cafe- Credit card volume exceeded loyalty card volume by 5 transactions on 01/18/2014 3) Brew’ve Been Served- Credit card volume exceeded loyalty card volume by 4 transactions on 01/09/2014
## Difference in volume by day
loyalty_data_count <- loyalty_data_cleaned %>%
group_by(location, date) %>%
summarise(txn_count = n()) %>%
arrange(desc(txn_count))
cc_data_count <- cc_data_cleaned %>%
group_by(location, date) %>%
summarise(txn_count = n()) %>%
arrange(desc(txn_count))
count_loyalty_cc_comb <- loyalty_data_count %>%
full_join(cc_data_count, by = c("location", "date")) %>%
rename(loyalty_card_txncount = "txn_count.x", credit_card_txncount = "txn_count.y") %>%
mutate_if(is.numeric, replace_na, replace = 0) %>%
mutate(difference_vol = loyalty_card_txncount - credit_card_txncount)
a <- list(title = "Difference in volume per day by location", showticklabels = TRUE,
dtick = 1)
txn_vol_difference_heatmap <- count_loyalty_cc_comb %>%
plot_ly(x = ~date, y = ~location, z = ~difference_vol, colors = brewer.pal(3,
"Blues"), type = "heatmap", hoverinfo = "text", text = ~paste("Date:", date,
"<br> Location:", location, "<br> Vol_Diference:", difference_vol))
txn_vol_difference_heatmap %>%
layout(yaxis = a, xaxis = list(dtick = "86400000.0", type = "date", title = "Date",
tickangle = 0, tickformat = "%d%b"), margin = m, plot_bgcolor = "#bdbdbd")
We will then analyze the transaction volume by day of week to observe volume trends across the week.
Loyalty card transaction volume
## Loyalty card volume by day of week
## Preparing the data
loyalty_data_cleaned$daynumber = lubridate::wday(loyalty_data_cleaned$date, week_start = 1)
loyalty_data_cleaned$weekday = factor(loyalty_data_cleaned$weekday, levels = c("Sunday",
"Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))
## Plots
shared_data_loyalty = SharedData$new(loyalty_data_cleaned, key = ~location)
loyalty_wkday_vol <- shared_data_loyalty %>%
plot_ly(width = 900, height = 600) %>%
group_by(weekday, location) %>%
summarize(transaction = n()) %>%
arrange(desc(transaction)) %>%
add_trace(x = ~weekday, y = ~transaction, color = ~location, type = "bar", hoverinfo = "text",
hovertext = ~paste("<br> Day_of_week:", weekday, "<br> Transaction Volume:",
transaction)) %>%
layout(xaxis = list(title = "Day_of_week", type = "factor", categoryorder = "factor",
categoryarray = loyalty_data_cleaned$weekday[order(loyalty_data_cleaned[["daynumber"]])],
showgrid = TRUE, showticklabels = TRUE), yaxis = list(title = "Transaction volume"))
bscols(widths = c(10, 4), list(filter_select("location", "Please Specify the location",
shared_data_loyalty, group = ~location, multiple = F), datatable(shared_data_loyalty) %>%
formatDate(1, method = "toLocaleDateString")), loyalty_wkday_vol)
Given that we are provided with credit card timestamp information, we will take one step further to analyze the volume of credit card transactions by location and time.
Credit card transaction volume
## Credit card volume by day of week Loading data
Sys.setlocale("LC_TIME", "English")
[1] "English_United States.1252"
## Preparing the data
cc_data_cleaned$daynumber = lubridate::wday(cc_data_cleaned$timestamp, week_start = 1)
cc_data_cleaned$weekday = factor(cc_data_cleaned$weekday, levels = c("Sunday", "Monday",
"Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))
## Preparing timegroup
cc_data_cleaned$timegroup = ""
cc_data_cleaned$timegroup[cc_data_cleaned$hour < 5 & 0 <= cc_data_cleaned$hour] = "00 - 04"
cc_data_cleaned$timegroup[cc_data_cleaned$hour < 9 & 5 <= cc_data_cleaned$hour] = "05 - 08"
cc_data_cleaned$timegroup[cc_data_cleaned$hour < 13 & 9 <= cc_data_cleaned$hour] = "09 - 12"
cc_data_cleaned$timegroup[cc_data_cleaned$hour < 17 & 13 <= cc_data_cleaned$hour] = "13 - 16"
cc_data_cleaned$timegroup[cc_data_cleaned$hour < 21 & 17 <= cc_data_cleaned$hour] = "17 - 20"
cc_data_cleaned$timegroup[cc_data_cleaned$hour <= 24 & 21 <= cc_data_cleaned$hour] = "21 - 24"
## Plots
shared_data = SharedData$new(cc_data_cleaned, key = ~location)
cc_wkday_vol <- shared_data %>%
plot_ly(width = 900, height = 600) %>%
group_by(weekday, location) %>%
summarize(transaction = n()) %>%
arrange(desc(transaction)) %>%
add_trace(x = ~weekday, y = ~transaction, color = ~location, type = "bar", hoverinfo = "text",
hovertext = ~paste("<br> Day_of_week:", weekday, "<br> Transaction Volume:",
transaction)) %>%
layout(xaxis = list(title = "Day_of_week", type = "factor", categoryorder = "factor",
categoryarray = cc_data_cleaned$weekday[order(cc_data_cleaned[["daynumber"]])],
showgrid = TRUE, showticklabels = TRUE), yaxis = list(title = "Transaction volume"))
cc_timegroup_vol <- shared_data %>%
plot_ly(width = 900, height = 600) %>%
group_by(weekday, timegroup, location) %>%
summarize(transaction = n()) %>%
add_trace(x = ~weekday, y = ~transaction, color = ~timegroup, type = "bar", hoverinfo = "text",
hovertext = ~paste("<br> Time:", timegroup, "<br>Transaction Volume:", transaction)) %>%
layout(xaxis = list(title = "Time", type = "factor", categoryorder = "factor",
categoryarray = cc_data_cleaned$weekday[order(cc_data_cleaned[["daynumber"]])],
showgrid = TRUE, showticklabels = TRUE), yaxis = list(title = "Transaction volume"))
bscols(widths = c(10, 4), list(filter_select("location", "Please Specify the location",
shared_data, group = ~location, multiple = F), cc_wkday_vol, cc_timegroup_vol,
datatable(shared_data) %>%
formatDate(1, method = "toLocaleDateString")))
As observed above, popular day of week differs for some of the locations such as Katerina’s Cafe. This is unexpected as we would expect the trends to be similar for both cards.
Other anomalies noted include:
glimpse(loyalty_data)
Rows: 1,392
Columns: 4
$ timestamp <dttm> 2014-01-06, 2014-01-06, 2014-01-06, 2014-01-06, ~
$ location <chr> "Brew've Been Served", "Brew've Been Served", "Ha~
$ price <dbl> 4.17, 9.60, 16.53, 11.51, 12.93, 4.27, 11.20, 15.~
$ loyaltynum <chr> "L2247", "L9406", "L8328", "L6417", "L1107", "L40~
glimpse(cc_data)
Rows: 1,490
Columns: 4
$ timestamp <dttm> 2014-01-06 07:28:00, 2014-01-06 07:34:00, 2014-0~
$ location <chr> "Brew've Been Served", "Hallowed Grounds", "Brew'~
$ price <dbl> 11.34, 52.22, 8.33, 16.72, 4.24, 4.17, 28.73, 9.6~
$ last4ccnum <dbl> 4795, 7108, 6816, 9617, 7384, 5368, 7253, 4948, 9~
Given that these timestamps may not be representative of the actual transaction time, we will be mindful of using this information for further analysis subsequently.
cc_data_incorrect_time <- cc_data_cleaned %>%
filter(location == "Bean There Done That" | location == "Brewed Awakenings" |
location == "Coffee Shack" | location == "Jack's Magical Beans")
cc_data_incorrect_time$timestamp <- as.character(cc_data_incorrect_time$timestamp)
cc_data_incorrect_time = subset(cc_data_incorrect_time, select = -c(timegroup, day,
hour, daynumber, weekday))
DT::datatable(cc_data_incorrect_time, options = list(pageLength = 10, width = "100%"))
cc_data_kronos <- cc_data_cleaned %>%
filter(location == "Kronos Mart")
cc_data_kronos$timestamp <- as.character(cc_data_kronos$timestamp)
cc_data_kronos = subset(cc_data_kronos, select = -c(timegroup, day, hour, daynumber,
weekday))
DT::datatable(cc_data_kronos, options = list(pageLength = 10, width = "100%"))
Qn 2- Add the vehicle data to your analysis of the credit and loyalty card data. How does your assessment of the anomalies in question 1 change based on this new data? What discrepancies between vehicle, credit, and loyalty card data do you find?
More investigation should be made into this discrepancy. One explanation could be that employees could have used more than one credit card with their loyalty card. Another explanation could be that there is a new employee who has not received the loyalty card.
Given that the distinct employee count is significantly fewer than the number of distinct loyalty and credit cards, we should check with Gastech if there are any employees missing from this list. We should not expect the number of distinct employees and distinct loyalty cards to differ significantly as it does not make sense to issue an employee with multiple loyalty cards.
## distinct employee names
distinct_empname <- location_carid_join %>%
group_by(FirstName, LastName) %>%
summarize(mean = mean(hour)) %>%
drop_na(FirstName)
distinct_empname <- subset(distinct_empname, select = -c(mean))
DT::datatable(distinct_empname, filter = "top", options = list(columnDefs = list(list(className = "dt-center",
targets = 0:2, autoWidth = TRUE, scrollX = TRUE))))
Figure 1: Distinct employee names
## distinct loyalty cards
DT::datatable(distinct_loyalty, options = list(columnDefs = list(list(className = "dt-center",
targets = 0:2, width = "20%"))))
Figure 2: Distinct loyalty card numbers
## distinct credit cards
DT::datatable(distinct_cc, options = list(columnDefs = list(list(className = "dt-center",
targets = 0:2, width = "20%"))))
Figure 3: Distinct credit card numbers
The case scenario does not state which CarIDs are referring to trucks. However, assuming that the 3 digit CarID represents trucks, we only note GPS data for five trucks. There is no evidence as to whether the truck ID is sequential or if each truck driver is assigned to a truck. Given that there are 9 truck drivers and only 5 truck GPS data provided, there is possibility that: a) Each truck driver is not assigned to a unique truck and trucks can be shared. b) There are 4 GPS information missing in the GPS dataset.
To perform further investigation on this, we will plot the GPS paths of each carID over the Abila map to identify their route.
## Employees with no cars assigned
missing_carid <- location_carid_join %>%
group_by(id) %>%
filter(is.na(id))
missing_carid = subset(missing_carid, select = -c(lat, long, date, day, hour))
DT::datatable(missing_carid, filter = "top", options = list(autoWidth = TRUE, scrollX = TRUE))
Figure 4: Employees with no assigned car
## Cars not assigned to employees
missing_empname <- location_carid_join %>%
group_by(id, FirstName, LastName) %>%
filter(is.na(FirstName) | is.na(LastName)) %>%
select(id) %>%
unique()
DT::datatable(missing_empname, options = list(columnDefs = list(list(className = "dt-center",
targets = 0:3, width = "25%"))))
Figure 5: CarID with no assigned employee
We will first perform georeferencing using the SHP files provided.
Georeferencing
Download and launch QGIS, an open-sourced GIS software.
Start a new project by clicking on Project> New.
After preparing the georeferencing, we will then import the raster layer into RStudio.
## Import raster layer
bgmap <- raster("data/Geospatial/MC2-tourist.tif")
bgmap
class : RasterLayer
band : 1 (of 3 bands)
dimensions : 1595, 2706, 4316070 (nrow, ncol, ncell)
resolution : 3.16216e-05, 3.16216e-05 (x, y)
extent : 24.82419, 24.90976, 36.04499, 36.09543 (xmin, xmax, ymin, ymax)
crs : +proj=longlat +datum=WGS84 +no_defs
source : MC2-tourist.tif
names : MC2.tourist
values : 0, 255 (min, max)
tmap_mode("plot")
tm_shape(bgmap) + tm_rgb(bgmap, r = 1, g = 2, b = 3, alpha = NA, saturation = 1,
interpolate = TRUE, max.value = 255)

Using st_read() of sf package, import Abila shapefile into R. We will then convert aspatial data to simple feature dataframe.
## Import line data
Abila_st <- st_read(dsn = "data/Geospatial", layer = "Abila")
Reading layer `Abila' from data source
`D:\stellaloh91\Assignment\data\Geospatial' using driver `ESRI Shapefile'
Simple feature collection with 3290 features and 9 fields
Geometry type: LINESTRING
Dimension: XY
Bounding box: xmin: 24.82401 ymin: 36.04502 xmax: 24.90997 ymax: 36.09492
Geodetic CRS: WGS 84
## Converting gps data to Simple Feature Data Frame
gps_data <- gps_data %>%
mutate(day = get_day(gps_data$Timestamp), date = as_date(gps_data$Timestamp),
minute = get_minute(gps_data$Timestamp), day_of_week = weekdays(gps_data$Timestamp),
hour = get_hour(gps_data$Timestamp))
gps_data$timegroup = cut(gps_data$hour, c(0, 4, 8, 12, 16, 20, 24))
levels(gps_data$timegroup) = c("0-4", "5-8", "8-12", "13-16", "17-20", "21-24")
gps_data
# A tibble: 685,169 x 10
Timestamp id lat long day date minute
<dttm> <fct> <dbl> <dbl> <int> <date> <int>
1 2014-01-06 06:28:01 35 36.1 24.9 6 2014-01-06 28
2 2014-01-06 06:28:01 35 36.1 24.9 6 2014-01-06 28
3 2014-01-06 06:28:03 35 36.1 24.9 6 2014-01-06 28
4 2014-01-06 06:28:05 35 36.1 24.9 6 2014-01-06 28
5 2014-01-06 06:28:06 35 36.1 24.9 6 2014-01-06 28
6 2014-01-06 06:28:07 35 36.1 24.9 6 2014-01-06 28
7 2014-01-06 06:28:09 35 36.1 24.9 6 2014-01-06 28
8 2014-01-06 06:28:10 35 36.1 24.9 6 2014-01-06 28
9 2014-01-06 06:28:11 35 36.1 24.9 6 2014-01-06 28
10 2014-01-06 06:28:12 35 36.1 24.9 6 2014-01-06 28
# ... with 685,159 more rows, and 3 more variables:
# day_of_week <chr>, hour <int>, timegroup <fct>
gps_sf <- st_as_sf(gps_data, coords = c("long", "lat"), crs = 4326)
Next, join the gps points into movement paths by using the drivers’ IDs as unique identifiers.
## Creating movement path from GPS points Group by need to come with
## summarize--> in this case we summarize using mean timestamp but this value
## is not needed
gps_path <- gps_sf %>%
group_by(id, day, timegroup, hour) %>%
summarize(m = mean(Timestamp), do_union = FALSE) %>%
st_cast("LINESTRING") %>%
rename(timestamp = "m")
Checking the data, we noticed single coordinates pair in the line feature. The following code chunk is to identify and remove the orphan lines.
## Remove one point linestrings
points = npts(gps_path, by_feature = TRUE)
gps_path <- cbind(gps_path, points)
gps_path_cleaned <- gps_path[!(gps_path$points == 1), ]
Lastly, we then overplot the selected gps path onto the background tourist map.
## Plotting GPS paths
gps_path_selected <- gps_path_cleaned %>%
filter(id == 1, day == 6)
tmap_mode("view")
tm_shape(bgmap) + tm_rgb(bgmap, r = 1, g = 2, b = 3, alpha = NA, saturation = 1,
interpolate = TRUE, max.value = 255) + tm_shape(gps_path_selected) + tm_lines(col = "red")
By plotting the GPS coordinates using the Abila tourist map as background, we are able to visualize the path each vehicle is taking. The map is also interactive. Clicking on any point in the trajectory allows us to see the CarID, day and timestamp of the point in the route. This allows us to match the timestamp and location back to the credit card dataset, hence matching the credit card numbers to their corresponding CarID.
## Facet map
gps_path_selected <- gps_path_cleaned %>%
filter(hour == 12)
tmap_mode("view")
tm_shape(bgmap) + tm_rgb(bgmap, r = 1, g = 2, b = 3, alpha = NA, saturation = 1,
interpolate = TRUE, max.value = 255) + tm_shape(gps_path_selected) + tm_lines(col = "red")
## Facet map
gps_path_selected <- gps_path_cleaned %>%
filter(hour == 3)
tmap_mode("plot")
tm_shape(bgmap) + tm_rgb(bgmap, r = 1, g = 2, b = 3, alpha = NA, saturation = 1,
interpolate = TRUE, max.value = 255) + tm_shape(gps_path_selected) + tm_lines(col = "red") +
tm_facets(by = "id", ncol = 3)

Qn3- Can you infer the owners of each credit card and loyalty card? What is your evidence? Where are there uncertainties in your method? Where are there uncertainties in the data?
By plotting the GPS coordinates using the Abila tourist map as background, we are able to visualize the path each vehicle is using. The map is interactive- clicking on any point in the trajectory allows us to see the CarID, day and timestamp of the respective route. This allows us to match the timestamp and location back to the credit card dataset, hence matching the credit card numbers to their corresponding CarID. Alternatively, we can also use a facet map by hour of day to identify location of each card during each hour of the day.
By following the GPS coordinates of each carID, we are then able to identify the time that they arrive at certain points of interest in the map. With this information, we can match the points of interest to the credit card transactions at these locations, based on the timestamp on the GPS data and credit card data. After matching the unique CarID to the credit cards, we can then derive the owner of each card based on the car assignment data. Using a facet map allows us to visualize the route for each car during each hour across each of the 14 days of GPS data in record. This allows for easy observation and matching to the credit card data.
We have used an interactive map and facet map below to visualize the route for CarID 1 across 1/6/2014. This allows for easy observation and matching to the credit card data.
## Plotting GPS paths- interactive map
gps_path_selected <- gps_path_cleaned %>%
filter(id == 1, day == 6)
tmap_mode("view")
tm_shape(bgmap) + tm_rgb(bgmap, r = 1, g = 2, b = 3, alpha = NA, saturation = 1,
interpolate = TRUE, max.value = 255) + tm_shape(gps_path_selected) + tm_lines(col = "red")
In this case, we can see that the day begins at 0720 where Car 1 drives towards Hallowed Grounds and stops here for 35 minutes. It then leaves Hallowed Grounds at 0757 and reaches Gastech at 0804. At 1217, Car 1 drives towards Albert’s Fine Clothes, arriving at 1226. It departs at 1325 back to Gastech and arrives at 1334. At 1744, Car 1 leaves Gastech toward the area near Hallowed Grounds, arriving at 1748. At 1936, Car 1 leaves the area near Hallowed Grounds for an area near Albert’s Fine Clothes and arrives at 1949. It then departs at 2027 back to the area near Hallowed Grounds, arriving at 2033. At 2211, Car 1 once again departs the area near Hallowed Grounds for Gastech, arriving at 2215, and only returns to the area near Hallowed Grounds at 0100 on 01/07/2014, arriving at 0114.
## Facet map- by hour
gps_path_selected <- gps_path_cleaned %>%
filter(id == 1, day == 6)
tmap_mode("plot")
tm_shape(bgmap) + tm_rgb(bgmap, r = 1, g = 2, b = 3, alpha = NA, saturation = 1,
interpolate = TRUE, max.value = 255) + tm_shape(gps_path_selected) + tm_lines(col = "red") +
tm_facets(by = "hour", ncol = 3)

By visualizaing the route, we can filter the below credit card table for the relevant day, hour and location. The datatable below allows for multiple filters and this allows us to narrow down the list of credit cards which fits the criteria across the day. By repeating this process across the 14 days credit card transaction and location data provided, we will be able to match the credit card to CarID.
In this case, by filtering the below table for Hallowed Grounds on day 6, hour 7, we noted that credit card number ending with 9551 is on the list of transactions. Repeating this process for the locations visited throughout the 2 weeks period, we have matched Car 1 to credit card number ending with 9551.
Based on interactive filtering and visualization of the maps above, we were able to match the GPS trajectories to the credit card data provided, based on location and timestamp. In the process, we noted that there car trajectories match more than 1 credit card match. This is seen for employees in the facilities department (carID: 29, 100 to 107). We also noted some credit cards without a CarID match.
We have imported the match via a CSV file below.
## Import match
cc_carid_match <- read_csv("data/cc_carid_match.csv")
cc_carid_match$CarID = as.factor(cc_carid_match$CarID)
cc_carid_match$last4ccnum = as.factor(cc_carid_match$last4ccnum)
glimpse(cc_carid_match)
Rows: 55
Columns: 2
$ last4ccnum <fct> 1286, 1310, 1321, 1415, 1874, 1877, 2142, 2276, 2~
$ CarID <fct> 22, 26, 11, 2, 14, 9, 25, 106, NA, 35, 7, NA, 23,~
Using the above dataset, we will be able to identify the owners of each credit card by matching the carID to the car assignment dataset. Scanning through the table, we noted that Ovan Bertrand (Facilities Group Manager) is assigned CarID 29 and uses 2 different credit cards (1- ending with 3547, 2- ending with 5921).
In the car assignment dataset provided, truck drivers are not assigned to a particular truck number. Hence we are unable to identify which credit card belongs to which truck driver as there is no car assignment information provided. We have managed to match the trajectories of trucks to 9 different credit cards, which coincides with the number of truck drivers. Hence this reinforces our opinion that trucks are shared among the truck drivers. With the limited information, we are unable to match the credit card to truck driver, but only to the CarID.
## Non-truck drivers
cc_carid_emp_match_NT <- cc_carid_match %>%
filter(CarID != 100, CarID != 101, CarID != 102, CarID != 103, CarID != 104,
CarID != 105, CarID != 106, CarID != 107) %>%
left_join(assignment_data, "CarID") %>%
arrange(CarID)
DT::datatable(cc_carid_emp_match_NT, filter = "top", fillContainer = T, options = list(scrollX = TRUE,
autoWidth = TRUE, scroller = TRUE, scrollY = "450px"))
## Truck drivers
cc_carid_emp_match_T <- cc_carid_match %>%
filter(CarID == 100 | CarID == 101 | CarID == 102 | CarID == 103 | CarID == 104 |
CarID == 105 | CarID == 106 | CarID == 107) %>%
arrange(CarID)
cc_carid_emp_match_T
# A tibble: 9 x 2
last4ccnum CarID
<fct> <fct>
1 3506 101
2 9220 101
3 9614 101
4 8642 104
5 7792 105
6 2276 106
7 9152 106
8 4530 107
9 9735 107
After matching the timestamp of credit card transactions to CarID using the trajectory visualization, we can then match the credit card and loyalty card owners, using full join on 3 criteria match- 1) Date, 2) Location and 3) Price.
## Match loyalty and credit cards
loyalty_cc_match <- cc_data_cleaned %>%
full_join(loyalty_data_cleaned, by = c("date", "location", "price"))
loyalty_cc_match = subset(loyalty_cc_match, select = -c(day.y, daynumber.y, timestamp,
weekday.y)) %>%
rename(day = "day.x", daynumber = "daynumber.x", weekday = "weekday.x")
DT::datatable(loyalty_cc_match, extensions = "Scroller", fillContainer = T, filter = "top",
options = list(scrollX = TRUE, autoWidth = TRUE, scroller = TRUE, scrollY = "450px")) %>%
formatDate(4, method = "toLocaleDateString")
After the full join, we noted that there are some transactions whereby the amount recorded on the loyalty card and credit card is different, hence resulting in N/As in the joined dataset. To correct this, we will drop rows with N/As and perform a group to identify which loyalty cards are mapped to each credit card.
## Group by credit card number and loyalty card number
loyalty_cc_match_group <- loyalty_cc_match %>%
group_by(last4ccnum, loyaltynum) %>%
summarise(sumprice = sum(price)) %>%
drop_na("last4ccnum", "loyaltynum")
## Check if each credit card is matched to each loyalty card
loyalty_cc_match_count <- loyalty_cc_match_group %>%
group_by(last4ccnum) %>%
summarise(loyalty_card_count = n()) %>%
arrange(desc(loyalty_card_count))
loyalty_cc_match_count
# A tibble: 55 x 2
last4ccnum loyalty_card_count
<fct> <int>
1 1286 2
2 4795 2
3 4948 2
4 5368 2
5 5921 2
6 7889 2
7 8332 2
8 1310 1
9 1321 1
10 1415 1
# ... with 45 more rows
Based on the above results, we noted that there are seven credit cards mapped to more than 1 loyalty card. This is likely to be a duplicate due to the method being used to match both cards. We will investigate each match individually to identify if its a true match or simply a coincidence. To differentiate the former from the latter, we will assess if there are several transactions made with the same combination of credit and loyalty card number. If there is only a single transaction made with the same combination, it is likely a coincidence. This can be done using the interactive datatable below.
After further investigation, it is noted that there are several transactions made with the same combination of credit and loyalty card number for the following combination: - Credit card number ending with 1286, matched with loyalty card number L3288 and L3572
Hence the above is unlikely to be a coincidence. Loyalty cards L3288 and L3572 are being used together with the same credit card.
We will then perform a check on whether each loyalty card is used with one or multiple credit cards. We see eight loyalty cards matched to multiple credit cards.
To prevent similar case of coincidence from the matching method above, we filter for combinations with more than 1 transaction using the interactive datatable below. From the table below, we observe that L3288 and L6267 are used with 2 different credit cards. 1) Loyalty card number L3288, matched with credit card number ending with 1286 and 9241 2) Loyalty card number L6267, matched with credit card number ending with 6691 and 6899
## Check for loyalty cards matched to multiple credit card
cc_loyalty_match_count <- cc_data_cleaned %>%
full_join(loyalty_data_cleaned, by = c("date", "location", "price")) %>%
group_by(last4ccnum, loyaltynum) %>%
summarise(txncount = n()) %>%
drop_na("last4ccnum", "loyaltynum") %>%
group_by(loyaltynum) %>%
summarise(cc_count = n()) %>%
arrange(desc(cc_count))
cc_loyalty_match_count
# A tibble: 54 x 2
loyaltynum cc_count
<chr> <int>
1 L2070 2
2 L2247 2
3 L3288 2
4 L3295 2
5 L6119 2
6 L6267 2
7 L8566 2
8 L9406 2
9 L1107 1
10 L1485 1
# ... with 44 more rows
## Check for loyalty cards matched to multiple credit card, with > 1 txn
cc_loyalty_match_obs <- cc_data_cleaned %>%
full_join(loyalty_data_cleaned, by = c("date", "location", "price")) %>%
group_by(last4ccnum, loyaltynum) %>%
summarise(txncount = n()) %>%
filter(txncount > 1) %>%
drop_na("last4ccnum", "loyaltynum") %>%
group_by(loyaltynum) %>%
summarise(cc_count = n()) %>%
arrange(desc(cc_count))
cc_loyalty_match_obs
# A tibble: 54 x 2
loyaltynum cc_count
<chr> <int>
1 L3288 2
2 L6267 2
3 L1107 1
4 L1485 1
5 L1682 1
6 L2070 1
7 L2169 1
8 L2247 1
9 L2343 1
10 L2459 1
# ... with 44 more rows
Given the assumption that the combination of credit and loyalty cards with only one transaction are due to coincidence, we will drop these duplicate matches.
loyalty_cc_match_cleaned <- loyalty_cc_match[!(loyalty_cc_match$last4ccnum == "4795" &
loyalty_cc_match$loyaltynum == "L2070" | loyalty_cc_match$last4ccnum == "5921" &
loyalty_cc_match$loyaltynum == "L9406" | loyalty_cc_match$last4ccnum == "7889" &
loyalty_cc_match$loyaltynum == "L2247" | loyalty_cc_match$last4ccnum == "4948" &
loyalty_cc_match$loyaltynum == "L3295" | loyalty_cc_match$last4ccnum == "5368" &
loyalty_cc_match$loyaltynum == "L6119" | loyalty_cc_match$last4ccnum == "8332" &
loyalty_cc_match$loyaltynum == "L8566"), ]
We then match the loyalty and credit card data to the car assignment data matched with credit card.
cards_emp_match <- left_join(loyalty_cc_match_cleaned, cc_carid_match, by = "last4ccnum") %>%
drop_na() %>%
left_join(assignment_data, by = "CarID")
Uncertainties in method: - Assumption that employee will make a transaction when they visit a location. GPS coordinates only infer that the employee was at a certain area but does not mean that he/she has to make any purchases when at that location. Hence matching GPS data to transaction data alone may not be accurate in some instances where employee does not spend/ makes purchases using cash instead of credit card. - As the tourist map only provides information about certain tourist attractions and not all the locations in Abila, we are unable to match the full list of transactions from trajectory as seen in the map view to the transactions occurring in locations which are not reflected in tourist map (e.g. Abila Zacharo). Furthermore, as Abila is relatively small, many of the GPS trajectories for different cars overlap. This may result in incorrect matching as it is more of a guesswork as not all points in the trajectory can be mapped to an actual transaction. - There are differences in the amount recorded on credit card and loyalty card for certain transactions, hence resulting in N/As in the joined dataset. We have removed the rows with N/As and instead worked with the joined data. However, this is based on the assumption that card transactions with same date, location and price relate to the same transaction. There may be instances of pure coincidence when this is not true, as described above. - Employee may not have used their assigned car or made a purchase using their own loyalty/ credit card.
Uncertainties in data: - As mentioned above, there are 44 unique CarID GPS coordinates provided. However, the number of unique credit and debit cards exceed 44. Hence, we are unable to map these other credit and loyalty cards to the CarID. - We noted that some of the timestamps provided in the credit card data seems inaccurate. The transaction data for “Been There Done That”. “Brewed Awakenings” and “Jack’s Magical Beans” are all transacted on 12:00 time. Hence I suspect that this does not represent the actual time of the transaction. Hence it is inaccurate to match GPS coordinates timestamp to the timestamp of these transactions. - Background story did not specify how truck IDs are segregated from CarIDs. We have inferred these based on the difference in ID, inferring that truck IDs are three digit IDs. - Truck drivers are not assigned a particular truck. Hence even though we are able to match the GPS trajectories of these trucks to certain credit cards, we are unable to match the credit and loyalty cards back to the employees, unless we are provided with information regarding which employee has checked out which truck at certain timings.
We will first create a network graph using the following code chunks.
## Prepare data
cc_carid_emp_match <- cc_carid_match %>%
drop_na() %>%
left_join(assignment_data, by = "CarID")
cc_carid_emp_match$LastName[is.na(cc_carid_emp_match$LastName)] <- "Truck Driver"
cc_carid_emp_match$FirstName[is.na(cc_carid_emp_match$FirstName)] <- "Truck Driver"
cc_carid_emp_match$CurrentEmploymentType[is.na(cc_carid_emp_match$CurrentEmploymentType)] <- "Facilities"
cc_carid_emp_match$CurrentEmploymentTitle[is.na(cc_carid_emp_match$CurrentEmploymentTitle)] <- "Truck Driver"
cc_data_cleaned_matched <- cc_data_cleaned %>%
left_join(cc_carid_emp_match, by = "last4ccnum")
## Create nodes list
sources <- cc_data_cleaned_matched %>%
distinct(last4ccnum) %>%
rename(label = last4ccnum)
destinations <- cc_data_cleaned_matched %>%
distinct(location) %>%
rename(label = location)
## Create single df of unique users and locations
cc_nodes <- full_join(sources, destinations, by = "label")
## Add id column to nodes df
cc_nodes <- cc_nodes %>%
rowid_to_column("id")
cc_nodes <- left_join(cc_nodes, cc_carid_emp_match, by = c(label = "last4ccnum"))
cc_nodes <- cc_nodes %>%
rename(group = CurrentEmploymentType)
## Create edges list
edges <- cc_data_cleaned_matched %>%
group_by(last4ccnum, location, day, hour) %>%
summarise(weight = n()) %>%
ungroup()
edges
# A tibble: 1,490 x 5
last4ccnum location day hour weight
<fct> <chr> <int> <int> <int>
1 1286 Abila Zacharo 6 13 1
2 1286 Abila Zacharo 9 13 1
3 1286 Abila Zacharo 13 13 1
4 1286 Abila Zacharo 16 13 1
5 1286 Ahaggo Museum 18 14 1
6 1286 Brew've Been Served 6 8 1
7 1286 Brew've Been Served 7 7 1
8 1286 Brew've Been Served 8 8 1
9 1286 Brew've Been Served 9 8 1
10 1286 Brew've Been Served 10 8 1
# ... with 1,480 more rows
## Tidy edges list
cc_edges <- edges %>%
left_join(cc_nodes, by = c(last4ccnum = "label")) %>%
rename(from = id)
cc_edges <- cc_edges %>%
left_join(cc_nodes, by = c(location = "label")) %>%
rename(to = id) %>%
mutate(timegroup = "")
cc_edges$timegroup[cc_edges$hour < 7 & 0 <= cc_edges$hour] = "00 - 06"
cc_edges$timegroup[cc_edges$hour < 10 & 7 <= cc_edges$hour] = "07 - 09"
cc_edges$timegroup[cc_edges$hour < 12 & 10 <= cc_edges$hour] = "10 - 11"
cc_edges$timegroup[cc_edges$hour < 15 & 12 <= cc_edges$hour] = "12 - 14"
cc_edges$timegroup[cc_edges$hour < 17 & 15 <= cc_edges$hour] = "15 - 16"
cc_edges$timegroup[cc_edges$hour < 22 & 17 <= cc_edges$hour] = "17 - 21"
cc_edges$timegroup[cc_edges$hour <= 24 & 22 <= cc_edges$hour] = "22 - 24"
## Reorder columns
cc_edges <- select(cc_edges, from, to, day, hour, timegroup, weight)
## Build network graph
cc_graph <- tbl_graph(nodes = cc_nodes, edges = cc_edges, directed = FALSE)
## Facet
cc_edges <- cc_edges %>%
filter(day == 6)
cc_graph <- tbl_graph(nodes = cc_nodes, edges = cc_edges, directed = FALSE)
set_graph_style()
g <- ggraph(cc_graph, layout = "nicely") + geom_edge_link(aes(width = weight), alpha = 0.2) +
scale_edge_width(range = c(0.1, 5)) + geom_node_point(aes(colour = group), size = 2)
g + facet_edges(day ~ timegroup) + th_foreground(foreground = "grey80", border = TRUE) +
theme(legend.position = "bottom")

## Visualizing network metrics- centrality indices
g <- cc_graph %>%
ggraph(layout = "fr") + geom_edge_link(aes(width = weight), alpha = 0.2) + scale_edge_width(range = c(0.1,
5)) + geom_node_point(aes(colour = group, size = centrality_betweenness()))
g + theme_graph()

##Visualizing community g <- cc_graph %>% mutate(community = as.factor(group_optimal(weights = weight))) %>% ggraph(layout = “fr”) + geom_edge_link(aes(width = weight), alpha=0.2) + scale_edge_width(range = c(0.1, 5)) + geom_node_point(aes(colour = community)) g + theme_graph()
By using the below interactive network graph together with the use of a datatable, we are able to visualize the interactions between parties and locations and use the datatable to filter for the relevant parties and locations to identify whether they were in the same location during the same period.
For example,
## Interactive viz
visNetwork(cc_nodes, cc_edges, height = "800px", width = "100%") %>%
visIgraphLayout(layout = "layout_in_circle") %>%
visOptions(highlightNearest = TRUE, nodesIdSelection = TRUE, selectedBy = "group") %>%
visLegend() %>%
visGroups(groupname = "Security", color = "#FFFFA3") %>%
visGroups(groupname = "Engineering", color = "#FFAFB1") %>%
visGroups(groupname = "Information Technology", color = "#A1EC76") %>%
visGroups(groupname = "Facilities", color = "#F0B3F5") %>%
visGroups(groupname = "Executive", color = "#FF3333") %>%
visLayout(randomSeed = 123)
We have derived the following potential informal relationships between employees.
Desafio Golf course- Several of the executives were congregated at the Desafio Golf Course on Sundays. Willem Vasco-Pais (Environmental Safety Advisor), Ingrid Barranco(SVP/CFO) and Ada Camp-Corrente (SVP/CIO) were present on 12/1/2014 at 1300-1400 and 19/1/2014 at 1200-1500. Sten Sanjorge Jr (President/CEO) and Orhan Strum (SVP/COO) joining during the second session.
Chostus Hotel- We see Elsa Orilla (Drill Technician) and Brand Tempestad (Drill Technician) from Engineering department checking into Chostus Hotel frequently on weekdays during lunch hour, up to four times over the two weeks observed. On further investigation, there are two employees with last name Orilla. We do not have further information on whether there is a relationship between them. However if they are partners, this may present as an illicit relationship given that it occurs frequently during lunch hours on weekdays when Kare Orilla is having lunch elsewhere.
Kronos Mart- Transactions noted on odd hours such as on 19/1/2014 at 0300 where Varja Lagos (Badging Office), Nils Calixto (IT Helpdesk) and Ada Campo-Corrente (SVP/CIO) made purchases during the same hour.
There were several locations only frequented by the Facilities department, such as Abila Airport, Kronos Pipe and Irrigation, Nationwide Refinery, Maximum Iron and Steel, Stewart and Sons Fabrication and Carlyle Chemical. Their visits are usually not at the same time.
Coffee Shack is visited by only one person with last 4 credit card number 7117 during lunch hour, on most weekdays.
Bean There Done That is popular with the Engineering team during lunch hour on most weekdays, with no transactions by these same group over the weekends.
Brewed Awakenings is favoured by Ingrid Barranco (SVP/CFO), Elsa Orilla (Drill Technician) and Ada Campo-Corrente (SVP/CIO) during lunch hour and they make purchases there frequently during the same periods, hence would probably have crossed paths here.
Jack’s Magical Beans is favoured by Isak Baza (IT Technician), Isande Borrasca (Drill Technician), Orhan Strum (SVP/COO), Willem Vasco-Pais (Environmental Safety Advisor) during lunch hour and they make purchases there frequently during the same periods, hence would probably have crossed paths here.
Both Isia Vann (Perimeter Control) and Edvard Vann (Perimeter Control) have the same last name. We do not have further information on whether there is a relationship between them. However, on further investigation, they seem to have a close relationship given that there were multiple times where they made transactions in the same location during the same date and hour, such as the following:
Both Birgitta Frente (Geologist) and Vira Frente (Hydraulic Technician) have the same last name. We do not have further information on whether there is a relationship between them. However, on further investigation, they seem to have a not have a particularly close relationship as there was only one time where they made transactions in the same location during the same date and hour- On 13/1/2014 at Bean There Done That at 1200.
There are other employees with the same last names such as Minke Mies and Henk Mies, Valeria Morlun and Adan Morlun, Claudio Hawelon and Benito Hawelon. However as some of them belong to the facilities truck drivers team, where due to the data limitations, we are unable to match each the CarID to Employee Name, we are unable to identify the routes of these truck drivers to determine if they were in the same location with the employees with similar last names.
We have identified the following suspicious activities:
From the facet graph below, we observe that he has returned to office during the period 2100-2400 on 08/01/2014, 15/01/2014 and 17/01/2014.
Filtering the datatable below, we also observe that Nils Calixto has spent $10000 at Frydos Autosupply n’ More. This expenditure has been identified as an outlier during our exploratory analysis performed earlier. This further raises suspicions due to the large transaction amount.
## Facet map- by day
gps_path_selected <- gps_path_cleaned %>%
filter(id == 1, timegroup == "21-24")
tmap_mode("plot")
tm_shape(bgmap) + tm_rgb(bgmap, r = 1, g = 2, b = 3, alpha = NA, saturation = 1,
interpolate = TRUE, max.value = 255) + tm_shape(gps_path_selected) + tm_lines(col = "red") +
tm_facets(by = "day", ncol = 3)

The employees who attended the gathering seem to be mainly from the Engineering and Information Technology department: - Nils Calixto (IT Helpdesk)- CarID 1 - Isak Baza (IT Technician)- CarID 5 - Lucas Alcazar (IT Technician)- CarID 8 - Gustav Cazar (Drill Technician)- CarID 9 - Axel Calzas (Hydraulic Technician)- CarID 11 - Lidelse Dedos (Engineering Group Manager)- CarID 14 - Vira Frente (Hydraulic Technician)- CarID 19 - Kanon Herrero (Geologist)- CarID 25 - Brand Tempestad (Drill Technician)- CarID 33 - Felix Balas (Engineer)- CarID 3 (left later at 11/1/2014 0000) - Marin Onda (Drill Site Manager)- CarID 26 (left later at 11/1/2014 0000)
## Facet map- by id
gps_path_selected <- gps_path_cleaned %>%
filter(day == 10, hour == 23)
tmap_mode("plot")
tm_shape(bgmap) + tm_rgb(bgmap, r = 1, g = 2, b = 3, alpha = NA, saturation = 1,
interpolate = TRUE, max.value = 255) + tm_shape(gps_path_selected) + tm_lines(col = "red") +
tm_facets(by = "id", ncol = 3)

## Facet map- by id
gps_path_selected <- gps_path_cleaned %>%
filter(day == 11, hour == 0)
tmap_mode("plot")
tm_shape(bgmap) + tm_rgb(bgmap, r = 1, g = 2, b = 3, alpha = NA, saturation = 1,
interpolate = TRUE, max.value = 255) + tm_shape(gps_path_selected) + tm_lines(col = "red") +
tm_facets(by = "id", ncol = 3)

## Interactive map
gps_path_selected <- gps_path_cleaned %>%
filter(id == 16, day == 10, hour == 23)
tmap_mode("view")
tm_shape(bgmap) + tm_rgb(bgmap, r = 1, g = 2, b = 3, alpha = NA, saturation = 1,
interpolate = TRUE, max.value = 255) + tm_shape(gps_path_selected) + tm_lines(col = "red")
Meeting at Spetson Park at night between Loreto Bodrogi and Isia Vann
Meeting at Taxiarchon Park at night between Loreto Bodrogi and Mies Minke
On 11/01/2014, Bertrand Ovan visited a sequence of places, without making any transactions. He left his house at 22:11, arriving at Brew’ve Been Served Cafe at 2212. After staying for 9 minutes, he went to the Ouzeri Elian, arriving at 2227. He then left at 2234 and headed for the place near Kronos Mart and arrived at 22:40. 15 minutes later, he went to the Alberts Fine Clothing arriving at 2258 and stayed there for another 23 minutes. He drove to an area near U-Pump or Jack Magic Beans, starting from 23:21 and arrived at the at 23:25. He stayed there for about 30 minutes, and arrived home at exactly 2359. This is quite unusual as he visited several locations without making any transactions.
Visiting a similar location late at night -Isia Vann (Perimeter Control, CarID 16) left her house at 2300 on 10/1/2014 and arrived at the house near Ahaggo Museum at 2320. She then left only at 0323 on 11/1/2014 and drove home, arriving home at 0334.
Hennie Osvaldo’s residences
Hennies Osavado seems to own two residences- 1) Area next to Guy’s Gyros, 2) Area near Frydo’s Autosupply N’More. He typically returns to the second house on weekdays and the first house on weekends, with the exception of 08/01/2014.
## Facet map- by day
gps_path_selected <- gps_path_cleaned %>%
filter(id == 21, timegroup == "21-24")
tmap_mode("plot")
tm_shape(bgmap) + tm_rgb(bgmap, r = 1, g = 2, b = 3, alpha = NA, saturation = 1,
interpolate = TRUE, max.value = 255) + tm_shape(gps_path_selected) + tm_lines(col = "red") +
tm_facets(by = "day", ncol = 3)

As discussed above, we also noted that Elsa Orilla (Drill Technician) and Brand Tempestad (Drill Technician) frequently visired Chostus Hotel on weekdays during lunch hour, up to four times over the two weeks observed. On further investigation, there are two employees with last name Orilla. We do not have further information on whether there is a relationship between them. However if they are partners, this may present as an illicit relationship given that it occurs frequently during lunch hours on weekdays when Kare Orilla is having lunch elsewhere. We also noticed that both parties tend to stagger their departure time from the location for about 10 minutes.
## Facet map- by id
gps_path_selected <- gps_path_cleaned %>%
filter(id == 7, timegroup == "8-12")
tmap_mode("plot")
tm_shape(bgmap) + tm_rgb(bgmap, r = 1, g = 2, b = 3, alpha = NA, saturation = 1,
interpolate = TRUE, max.value = 255) + tm_shape(gps_path_selected) + tm_lines(col = "red") +
tm_facets(by = "day", ncol = 3)

## Facet map- by id
gps_path_selected <- gps_path_cleaned %>%
filter(id == 7, timegroup == "13-16")
tmap_mode("plot")
tm_shape(bgmap) + tm_rgb(bgmap, r = 1, g = 2, b = 3, alpha = NA, saturation = 1,
interpolate = TRUE, max.value = 255) + tm_shape(gps_path_selected) + tm_lines(col = "red") +
tm_facets(by = "day", ncol = 3)
